Stemming and Segmentation for Classical Tibetan

نویسندگان

  • Orna Almogi
  • Lena Dankin
  • Nachum Dershowitz
  • Yair Hoffman
  • Dimitri Pauls
  • Dorji Wangchuk
  • Lior Wolf
چکیده

Tibetan is a monosyllabic language for which computerized language tools are largely lacking. We describe the development of a syllable stemmer for Tibetan. The stemmer is based on a set of rules that strive to identify the vowel, the core letter of the syllable, and then the other parts. We demonstrate the value of the stemmer with two applications: determining stem similarity of two syllables and word segmentation. Our stemmer is being made available as an open-source tool and word segmentation as a freely-available online tool. It is worthy of remark that a tongue which in its nature was monosyllabic, when written in the characters of a polysyllabic language like the Sanskrit, had necessarily to undergo some modification. Sarat Chandra Das, “Life of Sum-pa mkhan-po, also styled Ye-śes dpal-’byor, the author of Rehumig (Chronological Table)”, Journal of the Asiatic Society of Bengal (1889)

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

A Hackathon for Classical Tibetan

We describe the course of a hackathon dedicated to the development of linguistic tools for Tibetan Buddhist studies. Over a period of five days, a group of seventeen scholars, scientists, and students developed and compared algorithms for intertextual alignment and text classification, along with some basic language tools, including a stemmer and word segmenter. keywords Tibetan; hackathon; ste...

متن کامل

Research on Tibetan Automatic Word Segmentation

This paper researches on Tibetan automatic word segmentation. We focus on three key technologies of Tibetan automatic word segmentation: (1) a Tibetan automatic word segmentation approach is proposed, which is taking the advantage of case-auxiliary words and continuous feature. (2) a resolution method of overlapping ambiguity in Tibetan word segmentation is proposed, which is based on forward-b...

متن کامل

Perceptual evaluation of models for music segmentation

Background in music perception and cognition. Stemming from the seminal work of Lerdhal and Jackendoff (1983), a number of studies have examined the relevance of musicological rules and elements to the perceptual structure in music (Deliege, 1987; Clark and Krumhansl, 1990; Frankland and Cohen, 2004). While certain cues and rules have been shown to be related to perceptual segmentation, the foc...

متن کامل

Tibetan Unknown Word Identification from News Corpora for Supporting Lexicon-based Tibetan Word Segmentation

In Tibetan, as words are written consecutively without delimiters, finding unknown word boundary is difficult. This paper presents a hybrid approach for Tibetan unknown word identification for offline corpus processing. Firstly, Tibetan named entity is preprocessed based on natural annotation. Secondly, other Tibetan unknown words are extracted from word segmentation fragments using MTC, the co...

متن کامل

Tibetan Number Identification Based on Classification of Number Components in Tibetan Word Segmentation

Tibetan word segmentation is essential for Tibetan information processing. People mainly use the basic machine matching method which is based on dictionary to segment Tibetan words at present, because there is no segmented Tibetan corpus which can be used for training in Tibetan word segmentation. But the method based on dictionary is not fit to Tibetan number identification. This paper studies...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2016